import numpy as np
import pandas as pd
import plotly.express as plx
import plotly.graph_objects as go
This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).
Only 14 attributes used:
Complete attribute documentation:
1 id: patient identification number
2 ccf: social security number (I replaced this with a dummy value of 0)
3 age: age in years
4 sex: sex (1 = male; 0 = female)
5 painloc: chest pain location (1 = substernal; 0 = otherwise)
6 painexer (1 = provoked by exertion; 0 = otherwise)
7 relrest (1 = relieved after rest; 0 = otherwise)
8 pncaden (sum of 5, 6, and 7)
9 cp: chest pain type
-- Value 1: typical angina
-- Value 2: atypical angina
-- Value 3: non-anginal pain
-- Value 4: asymptomatic
10 trestbps: resting blood pressure (in mm Hg on admission to the hospital)
11 htn
12 chol: serum cholestoral in mg/dl
13 smoke: I believe this is 1 = yes; 0 = no (is or is not a smoker)
14 cigs (cigarettes per day)
15 years (number of years as a smoker)
16 fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
17 dm (1 = history of diabetes; 0 = no such history)
18 famhist: family history of coronary artery disease (1 = yes; 0 = no)
19 restecg: resting electrocardiographic results
-- Value 0: normal
-- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
-- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
20 ekgmo (month of exercise ECG reading)
21 ekgday(day of exercise ECG reading)
22 ekgyr (year of exercise ECG reading)
23 dig (digitalis used furing exercise ECG: 1 = yes; 0 = no)
24 prop (Beta blocker used during exercise ECG: 1 = yes; 0 = no)
25 nitr (nitrates used during exercise ECG: 1 = yes; 0 = no)
26 pro (calcium channel blocker used during exercise ECG: 1 = yes; 0 = no)
27 diuretic (diuretic used used during exercise ECG: 1 = yes; 0 = no)
28 proto: exercise protocol
1 = Bruce
2 = Kottus
3 = McHenry
4 = fast Balke
5 = Balke
6 = Noughton
7 = bike 150 kpa min/min (Not sure if "kpa min/min" is what was written!)
8 = bike 125 kpa min/min
9 = bike 100 kpa min/min
10 = bike 75 kpa min/min
11 = bike 50 kpa min/min
12 = arm ergometer
29 thaldur: duration of exercise test in minutes
30 thaltime: time when ST measure depression was noted
31 met: mets achieved
32 thalach: maximum heart rate achieved
33 thalrest: resting heart rate
34 tpeakbps: peak exercise blood pressure (first of 2 parts)
35 tpeakbpd: peak exercise blood pressure (second of 2 parts)
36 dummy
37 trestbpd: resting blood pressure
38 exang: exercise induced angina (1 = yes; 0 = no)
39 xhypo: (1 = yes; 0 = no)
40 oldpeak = ST depression induced by exercise relative to rest
41 slope: the slope of the peak exercise ST segment
-- Value 1: upsloping
-- Value 2: flat
-- Value 3: downsloping
42 rldv5: height at rest
43 rldv5e: height at peak exercise
44 ca: number of major vessels (0-3) colored by flourosopy
45 restckm: irrelevant
46 exerckm: irrelevant
47 restef: rest raidonuclid (sp?) ejection fraction
48 restwm: rest wall (sp?) motion abnormality
0 = none
1 = mild or moderate
2 = moderate or severe
3 = akinesis or dyskmem (sp?)
49 exeref: exercise radinalid (sp?) ejection fraction
50 exerwm: exercise wall (sp?) motion
51 thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
52 thalsev: not used
53 thalpul: not used
54 earlobe: not used
55 cmo: month of cardiac cath (sp?) (perhaps "call")
56 cday: day of cardiac cath (sp?)
57 cyr: year of cardiac cath (sp?)
58 num: diagnosis of heart disease (angiographic disease status)
-- Value 0: < 50% diameter narrowing
-- Value 1: > 50% diameter narrowing
(in any major vessel: attributes 59 through 68 are vessels)
59 lmt
60 ladprox
61 laddist
62 diag
63 cxmain
64 ramus
65 om1
66 om2
67 rcaprox
68 rcadist
69 lvx1: not used
70 lvx2: not used
71 lvx3: not used
72 lvx4: not used
73 lvf: not used
74 cathef: not used
75 junk: not used
76 name: last name of patient (I replaced this with the dummy string "name")
location = "C:/Users/harip/3D Objects/Self_Project/Heart Disease Prediction/processed.cleveland.data"
df_names = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca' , 'thal', 'num']
df = pd.read_csv(location, names = df_names, header = None)
df
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 63.0 | 1.0 | 1.0 | 145.0 | 233.0 | 1.0 | 2.0 | 150.0 | 0.0 | 2.3 | 3.0 | 0.0 | 6.0 | 0 |
| 1 | 67.0 | 1.0 | 4.0 | 160.0 | 286.0 | 0.0 | 2.0 | 108.0 | 1.0 | 1.5 | 2.0 | 3.0 | 3.0 | 2 |
| 2 | 67.0 | 1.0 | 4.0 | 120.0 | 229.0 | 0.0 | 2.0 | 129.0 | 1.0 | 2.6 | 2.0 | 2.0 | 7.0 | 1 |
| 3 | 37.0 | 1.0 | 3.0 | 130.0 | 250.0 | 0.0 | 0.0 | 187.0 | 0.0 | 3.5 | 3.0 | 0.0 | 3.0 | 0 |
| 4 | 41.0 | 0.0 | 2.0 | 130.0 | 204.0 | 0.0 | 2.0 | 172.0 | 0.0 | 1.4 | 1.0 | 0.0 | 3.0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 298 | 45.0 | 1.0 | 1.0 | 110.0 | 264.0 | 0.0 | 0.0 | 132.0 | 0.0 | 1.2 | 2.0 | 0.0 | 7.0 | 1 |
| 299 | 68.0 | 1.0 | 4.0 | 144.0 | 193.0 | 1.0 | 0.0 | 141.0 | 0.0 | 3.4 | 2.0 | 2.0 | 7.0 | 2 |
| 300 | 57.0 | 1.0 | 4.0 | 130.0 | 131.0 | 0.0 | 0.0 | 115.0 | 1.0 | 1.2 | 2.0 | 1.0 | 7.0 | 3 |
| 301 | 57.0 | 0.0 | 2.0 | 130.0 | 236.0 | 0.0 | 2.0 | 174.0 | 0.0 | 0.0 | 2.0 | 1.0 | 3.0 | 1 |
| 302 | 38.0 | 1.0 | 3.0 | 138.0 | 175.0 | 0.0 | 0.0 | 173.0 | 0.0 | 0.0 | 1.0 | ? | 3.0 | 0 |
303 rows × 14 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 303 entries, 0 to 302 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 303 non-null float64 1 sex 303 non-null float64 2 cp 303 non-null float64 3 trestbps 303 non-null float64 4 chol 303 non-null float64 5 fbs 303 non-null float64 6 restecg 303 non-null float64 7 thalach 303 non-null float64 8 exang 303 non-null float64 9 oldpeak 303 non-null float64 10 slope 303 non-null float64 11 ca 303 non-null object 12 thal 303 non-null object 13 num 303 non-null int64 dtypes: float64(11), int64(1), object(2) memory usage: 33.3+ KB
Checking for missing values and handling it with the filtering approach.
Here, the missing values are represented as '?' in the dataset.
new_df = df.where(df['age'] == '?')
new_df = new_df.dropna(how = 'any')
new_df
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | num |
|---|
new_df = df.where(df['sex'] == '?')
new_df = new_df.dropna(how = 'any')
new_df
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | num |
|---|
new_df = df.where(df['cp'] == '?')
new_df = new_df.dropna(how = 'any')
new_df
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | num |
|---|
new_df = df.where(df['trestbps'] == '?')
new_df = new_df.dropna(how = 'any')
new_df
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | num |
|---|
new_df = df.where(df['chol'] == '?')
new_df = new_df.dropna(how = 'any')
new_df
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | num |
|---|
new_df = df.where(df['fbs'] == '?')
new_df = new_df.dropna(how = 'any')
new_df
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | num |
|---|
new_df = df.where(df['restecg'] == '?')
new_df = new_df.dropna(how = 'any')
new_df
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | num |
|---|
new_df = df.where(df['thalach'] == '?')
new_df = new_df.dropna(how = 'any')
new_df
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | num |
|---|
new_df = df.where(df['exang'] == '?')
new_df = new_df.dropna(how = 'any')
new_df
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | num |
|---|
new_df = df.where(df['oldpeak'] == '?')
new_df = new_df.dropna(how = 'any')
new_df
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | num |
|---|
new_df = df.where(df['slope'] == '?')
new_df = new_df.dropna(how = 'any')
new_df
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | num |
|---|
new_df = df.where(df['ca'] == '?')
new_df = new_df.dropna(how = 'any')
new_df
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 166 | 52.0 | 1.0 | 3.0 | 138.0 | 223.0 | 0.0 | 0.0 | 169.0 | 0.0 | 0.0 | 1.0 | ? | 3.0 | 0.0 |
| 192 | 43.0 | 1.0 | 4.0 | 132.0 | 247.0 | 1.0 | 2.0 | 143.0 | 1.0 | 0.1 | 2.0 | ? | 7.0 | 1.0 |
| 287 | 58.0 | 1.0 | 2.0 | 125.0 | 220.0 | 0.0 | 0.0 | 144.0 | 0.0 | 0.4 | 2.0 | ? | 7.0 | 0.0 |
| 302 | 38.0 | 1.0 | 3.0 | 138.0 | 175.0 | 0.0 | 0.0 | 173.0 | 0.0 | 0.0 | 1.0 | ? | 3.0 | 0.0 |
Here, we are going to fill the first and last record based upon the filters on categorical value in the above missing record.
Here, the categorical features are, 1.sex, 2.cp, 3.fbs, 4.restecg, 5.exang, 6.slope, 7. thal and 8.num.
Here, we used only first seven features mentioned above for filtering the data.
Let's see what will be the 'ca' value for our filter.
temp_df = df.where((df['sex'] == 1) & (df['cp'] == 3.0) & (df['fbs'] == 0.0) & (df['oldpeak'] == 0.0) & (df['exang'] == 0.0) & (df['slope'] == 1.0))
temp_df = temp_df.dropna(how = 'any')
temp_df
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 32 | 64.0 | 1.0 | 3.0 | 140.0 | 335.0 | 0.0 | 0.0 | 158.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3.0 | 1.0 |
| 82 | 39.0 | 1.0 | 3.0 | 140.0 | 321.0 | 0.0 | 2.0 | 182.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3.0 | 0.0 |
| 85 | 44.0 | 1.0 | 3.0 | 140.0 | 235.0 | 0.0 | 2.0 | 180.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3.0 | 0.0 |
| 86 | 47.0 | 1.0 | 3.0 | 138.0 | 257.0 | 0.0 | 2.0 | 156.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3.0 | 0.0 |
| 145 | 47.0 | 1.0 | 3.0 | 108.0 | 243.0 | 0.0 | 0.0 | 152.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3.0 | 1.0 |
| 147 | 41.0 | 1.0 | 3.0 | 112.0 | 250.0 | 0.0 | 0.0 | 179.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3.0 | 0.0 |
| 166 | 52.0 | 1.0 | 3.0 | 138.0 | 223.0 | 0.0 | 0.0 | 169.0 | 0.0 | 0.0 | 1.0 | ? | 3.0 | 0.0 |
| 190 | 50.0 | 1.0 | 3.0 | 129.0 | 196.0 | 0.0 | 0.0 | 163.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3.0 | 0.0 |
| 263 | 44.0 | 1.0 | 3.0 | 120.0 | 226.0 | 0.0 | 0.0 | 169.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3.0 | 0.0 |
| 269 | 42.0 | 1.0 | 3.0 | 130.0 | 180.0 | 0.0 | 0.0 | 150.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3.0 | 0.0 |
| 281 | 47.0 | 1.0 | 3.0 | 130.0 | 253.0 | 0.0 | 0.0 | 179.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3.0 | 0.0 |
| 302 | 38.0 | 1.0 | 3.0 | 138.0 | 175.0 | 0.0 | 0.0 | 173.0 | 0.0 | 0.0 | 1.0 | ? | 3.0 | 0.0 |
Here, we got it..
Here, all the values in 'ca' feature is 0.0 expect that question mark i.e, the first and fourth record of latest 'new_df'.
Straight away we can handle it with the value 0.0 on first and 4th record of latest 'new_df'.
df.loc[((df['sex'] == 1) & (df['cp'] == 3.0) & (df['fbs'] == 0.0) & (df['exang'] == 0.0) & (df['slope'] == 1.0) & (df['ca'] == '?')), 'ca'] = 0.0
The above code is used to replace the question mark with value 0.0
Now, we handled the first and last record.
new_df = df.where(df['ca'] == '?')
new_df = new_df.dropna(how = 'any')
new_df
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 192 | 43.0 | 1.0 | 4.0 | 132.0 | 247.0 | 1.0 | 2.0 | 143.0 | 1.0 | 0.1 | 2.0 | ? | 7.0 | 1.0 |
| 287 | 58.0 | 1.0 | 2.0 | 125.0 | 220.0 | 0.0 | 0.0 | 144.0 | 0.0 | 0.4 | 2.0 | ? | 7.0 | 0.0 |
We are having another 2 records with missing values in feature 'ca'.
Again the same approach will be used for filtering.
temp_df = df.where((df['sex'] == 1) & (df['cp'] == 4.0) & (df['fbs'] == 1.0) & (df['exang'] == 1.0) & (df['slope'] == 2.0) & (df['num'] == 1.0) )
temp_df = temp_df.dropna(how = 'any')
temp_df
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 111 | 56.0 | 1.0 | 4.0 | 125.0 | 249.0 | 1.0 | 2.0 | 144.0 | 1.0 | 1.2 | 2.0 | 1.0 | 3.0 | 1.0 |
| 192 | 43.0 | 1.0 | 4.0 | 132.0 | 247.0 | 1.0 | 2.0 | 143.0 | 1.0 | 0.1 | 2.0 | ? | 7.0 | 1.0 |
Here, we are having only two records after filtering. There will be only one option to handle missing 'ca' value is replacing it with 1.0.
df.loc[((df['sex'] == 1) & (df['cp'] == 4.0) & (df['fbs'] == 1.0) & (df['exang'] == 1.0) & (df['slope'] == 2.0) & (df['num'] == 1.0)), 'ca'] = 1.0
Now, we handled the 2nd record..
new_df = df.where(df['ca'] == '?')
new_df = new_df.dropna(how = 'any')
new_df
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 287 | 58.0 | 1.0 | 2.0 | 125.0 | 220.0 | 0.0 | 0.0 | 144.0 | 0.0 | 0.4 | 2.0 | ? | 7.0 | 0.0 |
The same filtering approach continued..
temp_df = df.where((df['sex'] == 1) & (df['cp'] == 2.0) & (df['fbs'] == 0.0) & (df['exang'] == 0.0) & (df['slope'] == 2.0) & (df['num'] == 0.0) )
temp_df = temp_df.dropna(how = 'any')
temp_df
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 78 | 48.0 | 1.0 | 2.0 | 130.0 | 245.0 | 0.0 | 2.0 | 180.0 | 0.0 | 0.2 | 2.0 | 0.0 | 3.0 | 0.0 |
| 115 | 41.0 | 1.0 | 2.0 | 135.0 | 203.0 | 0.0 | 0.0 | 132.0 | 0.0 | 0.0 | 2.0 | 0.0 | 6.0 | 0.0 |
| 287 | 58.0 | 1.0 | 2.0 | 125.0 | 220.0 | 0.0 | 0.0 | 144.0 | 0.0 | 0.4 | 2.0 | ? | 7.0 | 0.0 |
Here, we can replace the missing 'ca' value with 0.0
df.loc[((df['sex'] == 1) & (df['cp'] == 2.0) & (df['fbs'] == 0.0) & (df['exang'] == 0.0) & (df['slope'] == 2.0) & (df['num'] == 0.0) & (df['ca'] == '?') ), 'ca'] = 0.0
Now, 3rd record also handled..
All the four missing values were handled..
Now we can check any other missing values in 'ca' feature.
new_df = df.where(df['ca'] == '?')
new_df = new_df.dropna(how = 'any')
new_df
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | num |
|---|
Everything done in 'ca' feature.
Move to the next feature 'thal'.
new_df = df.where(df['thal'] == '?')
new_df = new_df.dropna(how = 'any')
new_df
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 87 | 53.0 | 0.0 | 3.0 | 128.0 | 216.0 | 0.0 | 2.0 | 115.0 | 0.0 | 0.0 | 1.0 | 0.0 | ? | 0.0 |
| 266 | 52.0 | 1.0 | 4.0 | 128.0 | 204.0 | 1.0 | 0.0 | 156.0 | 1.0 | 1.0 | 2.0 | 0.0 | ? | 2.0 |
temp_df = df.where((df['sex'] == 0.0) & (df['cp'] == 3.0) & (df['fbs'] == 0.0) & (df['exang'] == 0.0) & (df['oldpeak'] == 0.0) & (df['slope'] == 1.0) & (df['num'] == 0.0) )
temp_df = temp_df.dropna(how = 'any')
temp_df
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 26 | 58.0 | 0.0 | 3.0 | 120.0 | 340.0 | 0.0 | 0.0 | 172.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3.0 | 0.0 |
| 87 | 53.0 | 0.0 | 3.0 | 128.0 | 216.0 | 0.0 | 2.0 | 115.0 | 0.0 | 0.0 | 1.0 | 0.0 | ? | 0.0 |
| 94 | 63.0 | 0.0 | 3.0 | 135.0 | 252.0 | 0.0 | 2.0 | 172.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3.0 | 0.0 |
| 149 | 60.0 | 0.0 | 3.0 | 102.0 | 318.0 | 0.0 | 0.0 | 160.0 | 0.0 | 0.0 | 1.0 | 1.0 | 3.0 | 0.0 |
| 210 | 37.0 | 0.0 | 3.0 | 120.0 | 215.0 | 0.0 | 0.0 | 170.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3.0 | 0.0 |
| 221 | 54.0 | 0.0 | 3.0 | 108.0 | 267.0 | 0.0 | 2.0 | 167.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3.0 | 0.0 |
| 222 | 39.0 | 0.0 | 3.0 | 94.0 | 199.0 | 0.0 | 0.0 | 179.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3.0 | 0.0 |
| 227 | 67.0 | 0.0 | 3.0 | 152.0 | 277.0 | 0.0 | 0.0 | 172.0 | 0.0 | 0.0 | 1.0 | 1.0 | 3.0 | 0.0 |
| 234 | 54.0 | 0.0 | 3.0 | 160.0 | 201.0 | 0.0 | 0.0 | 163.0 | 0.0 | 0.0 | 1.0 | 1.0 | 3.0 | 0.0 |
Now, we can replace missing 'thal' value with 3.0
df.loc[((df['sex'] == 0.0) & (df['cp'] == 3.0) & (df['fbs'] == 0.0) & (df['exang'] == 0.0) & (df['oldpeak'] == 0.0) & (df['slope'] == 1.0) & (df['num'] == 0.0) & (df['thal'] == '?') ), 'thal'] = 3.0
Now, we handled the missing value of 'thal' in 1st record of latest 'new_df'.
new_df = df.where(df['thal'] == '?')
new_df = new_df.dropna(how = 'any')
new_df
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 266 | 52.0 | 1.0 | 4.0 | 128.0 | 204.0 | 1.0 | 0.0 | 156.0 | 1.0 | 1.0 | 2.0 | 0.0 | ? | 2.0 |
temp_df = df.where((df['sex'] == 1.0) & (df['cp'] == 4.0) & (df['fbs'] == 1.0) &(df['num'] == 2.0) )
temp_df = temp_df.dropna(how = 'any')
temp_df
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 31 | 60.0 | 1.0 | 4.0 | 117.0 | 230.0 | 1.0 | 0.0 | 160.0 | 1.0 | 1.4 | 1.0 | 2.0 | 7.0 | 2.0 |
| 236 | 56.0 | 1.0 | 4.0 | 130.0 | 283.0 | 1.0 | 2.0 | 103.0 | 1.0 | 1.6 | 3.0 | 0.0 | 7.0 | 2.0 |
| 266 | 52.0 | 1.0 | 4.0 | 128.0 | 204.0 | 1.0 | 0.0 | 156.0 | 1.0 | 1.0 | 2.0 | 0.0 | ? | 2.0 |
| 299 | 68.0 | 1.0 | 4.0 | 144.0 | 193.0 | 1.0 | 0.0 | 141.0 | 0.0 | 3.4 | 2.0 | 2.0 | 7.0 | 2.0 |
Here, we can handle this missing value of 'tha' with 7.0
df.loc[((df['sex'] == 1.0) & (df['cp'] == 4.0) & (df['fbs'] == 1.0) & (df['num'] == 2.0) & (df['thal'] == '?') ), 'thal'] = 7.0
new_df = df.where(df['thal'] == '?')
new_df = new_df.dropna(how = 'any')
new_df
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | num |
|---|
Everything done in 'thal' feature.
Move to the next feature 'num'
new_df = df.where(df['num'] == '?')
new_df = new_df.dropna(how = 'any')
new_df
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | num |
|---|
df
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 63.0 | 1.0 | 1.0 | 145.0 | 233.0 | 1.0 | 2.0 | 150.0 | 0.0 | 2.3 | 3.0 | 0.0 | 6.0 | 0 |
| 1 | 67.0 | 1.0 | 4.0 | 160.0 | 286.0 | 0.0 | 2.0 | 108.0 | 1.0 | 1.5 | 2.0 | 3.0 | 3.0 | 2 |
| 2 | 67.0 | 1.0 | 4.0 | 120.0 | 229.0 | 0.0 | 2.0 | 129.0 | 1.0 | 2.6 | 2.0 | 2.0 | 7.0 | 1 |
| 3 | 37.0 | 1.0 | 3.0 | 130.0 | 250.0 | 0.0 | 0.0 | 187.0 | 0.0 | 3.5 | 3.0 | 0.0 | 3.0 | 0 |
| 4 | 41.0 | 0.0 | 2.0 | 130.0 | 204.0 | 0.0 | 2.0 | 172.0 | 0.0 | 1.4 | 1.0 | 0.0 | 3.0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 298 | 45.0 | 1.0 | 1.0 | 110.0 | 264.0 | 0.0 | 0.0 | 132.0 | 0.0 | 1.2 | 2.0 | 0.0 | 7.0 | 1 |
| 299 | 68.0 | 1.0 | 4.0 | 144.0 | 193.0 | 1.0 | 0.0 | 141.0 | 0.0 | 3.4 | 2.0 | 2.0 | 7.0 | 2 |
| 300 | 57.0 | 1.0 | 4.0 | 130.0 | 131.0 | 0.0 | 0.0 | 115.0 | 1.0 | 1.2 | 2.0 | 1.0 | 7.0 | 3 |
| 301 | 57.0 | 0.0 | 2.0 | 130.0 | 236.0 | 0.0 | 2.0 | 174.0 | 0.0 | 0.0 | 2.0 | 1.0 | 3.0 | 1 |
| 302 | 38.0 | 1.0 | 3.0 | 138.0 | 175.0 | 0.0 | 0.0 | 173.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3.0 | 0 |
303 rows × 14 columns
df['num'].value_counts()
0 164 1 55 2 36 3 35 4 13 Name: num, dtype: int64
Here, in the target variable 'num' 0 means no disease..
1, 2, 3, 4 are diseased and those values represents the severeness or stages of heart disease. Here we are going to consider them as diseased person only i.e, 1,2,3,4 are considered as 1. '1' means diseased.
df.loc[(df['num'] > 0), 'num'] = 1
df
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 63.0 | 1.0 | 1.0 | 145.0 | 233.0 | 1.0 | 2.0 | 150.0 | 0.0 | 2.3 | 3.0 | 0.0 | 6.0 | 0 |
| 1 | 67.0 | 1.0 | 4.0 | 160.0 | 286.0 | 0.0 | 2.0 | 108.0 | 1.0 | 1.5 | 2.0 | 3.0 | 3.0 | 1 |
| 2 | 67.0 | 1.0 | 4.0 | 120.0 | 229.0 | 0.0 | 2.0 | 129.0 | 1.0 | 2.6 | 2.0 | 2.0 | 7.0 | 1 |
| 3 | 37.0 | 1.0 | 3.0 | 130.0 | 250.0 | 0.0 | 0.0 | 187.0 | 0.0 | 3.5 | 3.0 | 0.0 | 3.0 | 0 |
| 4 | 41.0 | 0.0 | 2.0 | 130.0 | 204.0 | 0.0 | 2.0 | 172.0 | 0.0 | 1.4 | 1.0 | 0.0 | 3.0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 298 | 45.0 | 1.0 | 1.0 | 110.0 | 264.0 | 0.0 | 0.0 | 132.0 | 0.0 | 1.2 | 2.0 | 0.0 | 7.0 | 1 |
| 299 | 68.0 | 1.0 | 4.0 | 144.0 | 193.0 | 1.0 | 0.0 | 141.0 | 0.0 | 3.4 | 2.0 | 2.0 | 7.0 | 1 |
| 300 | 57.0 | 1.0 | 4.0 | 130.0 | 131.0 | 0.0 | 0.0 | 115.0 | 1.0 | 1.2 | 2.0 | 1.0 | 7.0 | 1 |
| 301 | 57.0 | 0.0 | 2.0 | 130.0 | 236.0 | 0.0 | 2.0 | 174.0 | 0.0 | 0.0 | 2.0 | 1.0 | 3.0 | 1 |
| 302 | 38.0 | 1.0 | 3.0 | 138.0 | 175.0 | 0.0 | 0.0 | 173.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3.0 | 0 |
303 rows × 14 columns
Now, Handling missing values is done.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 303 entries, 0 to 302 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 303 non-null float64 1 sex 303 non-null float64 2 cp 303 non-null float64 3 trestbps 303 non-null float64 4 chol 303 non-null float64 5 fbs 303 non-null float64 6 restecg 303 non-null float64 7 thalach 303 non-null float64 8 exang 303 non-null float64 9 oldpeak 303 non-null float64 10 slope 303 non-null float64 11 ca 303 non-null object 12 thal 303 non-null object 13 num 303 non-null int64 dtypes: float64(11), int64(1), object(2) memory usage: 33.3+ KB
Here, the features 'ca' and 'thal' is in object datatype. we have to change it to float type.
df['ca'] = df['ca'].astype(float)
df['thal'] = df['thal'].astype(float)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 303 entries, 0 to 302 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 303 non-null float64 1 sex 303 non-null float64 2 cp 303 non-null float64 3 trestbps 303 non-null float64 4 chol 303 non-null float64 5 fbs 303 non-null float64 6 restecg 303 non-null float64 7 thalach 303 non-null float64 8 exang 303 non-null float64 9 oldpeak 303 non-null float64 10 slope 303 non-null float64 11 ca 303 non-null float64 12 thal 303 non-null float64 13 num 303 non-null int64 dtypes: float64(13), int64(1) memory usage: 33.3 KB
no_dis = df.where(df['num'] == 0)
no_dis = no_dis.dropna(how = 'any')
no_dis.loc[(no_dis['sex'] == 0.0), 'sex'] = 'female'
no_dis.loc[(no_dis['sex'] == 1.0), 'sex'] = 'male'
dis = df.where(df['num'] == 1)
dis = dis.dropna(how = 'any')
dis.loc[(dis['sex'] == 0.0), 'sex'] = 'female'
dis.loc[(dis['sex'] == 1.0), 'sex'] = 'male'
fig = go.Figure()
fig.add_trace(go.Histogram(x = no_dis['sex'], name = "No Disease"))
fig.add_trace(go.Histogram(x = dis['sex'] , name = "Diseased"))
fig.update_layout(title_text = "Status of the Patient based on the feature 'sex' ", # title of plot
xaxis_title_text = 'sex', # xaxis label
yaxis_title_text = 'Count') # yaxis label
fig.show()
From the above histogram, we clearly notice that 'male' having higher possibility of having heart disease.
Nearly one third of the female is having heart disease.
Male having more disease than female. Let's see what's happening in the 'male'.
new_df = df.where(df['sex'] == 1.0)
new_df = new_df.dropna(how = 'any')
plx.scatter(x = new_df['age'], color = new_df['num'])
From the above scatter plot, we can clearly see that age group of 50 to 70 having more heart disease compared to age below 50. We confirm that 'age' is an important feature in this problem.
Only one male is there for the age group above 70, and having heart disease.
Let's see what's happening for female based on 'age'.
new_df = df.where(df['sex'] == 0.0)
new_df = new_df.dropna(how = 'any')
plx.scatter(x = new_df['age'], color = new_df['num'])
Here except one female, all female having disease comes from the age group of 50 to 70.
From the above two visualization we clearly got an insight that the age group of 50 to 70 having more chance of having heart disease.
Let's move on to the feature 'cp' to check if 'cp' value is most important for having heart disease or not.
import copy
new_df = copy.deepcopy(df)
new_df.loc[(new_df['num'] == 0), 'num'] = 'Not Diseased'
new_df.loc[(new_df['num'] == 1), 'num'] = 'Diseased'
plx.histogram(new_df, x = 'cp', color = 'num', barmode = 'stack')
Out of 144 people, 106 people having heart disease whose 'cp' value is 4 i.e., Around 3/4'th people having heart disease whose 'cp' value is 4.
Moving on to the 'trestbps' feature which is continuous data.
plx.scatter(new_df, x = 'trestbps', color = 'num')
From the above visualization, we clearly see that 'tresbps' value from 94 to 200 having both diseased and non diseased people in every single data point.
Here, we can't canclude anything based on the feature 'trestbps'.
plx.scatter(new_df, x = 'chol', color = 'num')
Here, the datapoints from 394 to 564 looks like an outlier. But, we can't remove those datapoints.. The reason is, 1. There is no pattern in the chol with respect to the 'num' i.e., like if cholesterol increases the possibility of having heart disease increases. In the above scatter plot, there is no pattern.
plx.histogram(new_df, x = 'fbs', color = 'num', barmode = 'stack')
Nearly Half of them were diseased in both the value of Fasting Blood Sugar (0.0-> False and 1.0-> True).
plx.histogram(new_df, x = 'restecg', color = 'num', barmode = 'stack')
Three of them were diseased when restecg value is 1.0.
More than half of them were diseased when rest ecg value is 2.0.
More than half of them were not diseased when rest ecg value is 0.0.
plx.scatter(new_df, x = 'thalach', color = 'num')
From the above scatter plot, we noticed that most of them were diseased when thalach value is less than 160.
The chance of having disease becomes low when thalach is increasing.
plx.histogram(new_df, x = 'exang', color = 'num', barmode = 'stack')
Around 30% of them were diseased when 'exang' value is 0.0
Around 75% of them were diseased when 'exang' value is 1.0
oldpeak_unique = list(np.array(new_df['oldpeak'].unique()))
oldpeak_unique.sort()
plx.histogram(new_df, x = 'oldpeak', color = 'num', nbins = 80)
The oldpeak value starts from zero. The percentage of diseased people is less in the value 0.
Let's visualize this in line chart after performing some calculations.
oldpeak_non_diseased_percentage = []
oldpeak_diseased_percentage = []
for x in oldpeak_unique:
oldpeak_df = df.where( (df['oldpeak'] == x) )
oldpeak_df = oldpeak_df.dropna(how = 'any')
oldpeak_df_1 = oldpeak_df.where( oldpeak_df['num'] == 1 )
oldpeak_df_1 = oldpeak_df_1.dropna(how = 'any')
oldpeak_df_0 = oldpeak_df.where( oldpeak_df['num'] == 0 )
oldpeak_df_0 = oldpeak_df_0.dropna(how = 'any')
diseased_percentage = ( len(oldpeak_df_1) / len(oldpeak_df) ) * 100
non_diseased_percentage = ( len(oldpeak_df_0) / len(oldpeak_df) ) * 100
oldpeak_diseased_percentage.append(diseased_percentage)
oldpeak_non_diseased_percentage.append(non_diseased_percentage)
fig = go.Figure()
fig.add_trace(go.Scatter(x = oldpeak_unique,
y = oldpeak_diseased_percentage,
mode = 'lines',
name = 'Diseased') )
fig.update_layout( title = 'Percentage of diseased patient on every oldpeak value')
fig.show()
In the value 0.7, 1.1, 1.3, 2.3 and 3.5, the diseased people percentage is 0. Otherwise, the percentage of having heart disease is increasing when 'oldpeak' value is increasing.
plx.histogram(new_df, x = new_df['slope'], color = new_df['num'])
Out of 142, 36(Around 25%) of them were diseased when 'slope' value is 1.
Out of 140, 91(exactly 65%) of them were diseased when 'slope' value is 2.
out of 21, 12(Around 57%) of them were diseased when 'slope' value is 3.
new_df['ca'].value_counts()
0.0 179 1.0 66 2.0 38 3.0 20 Name: ca, dtype: int64
plx.histogram(new_df, x = new_df['ca'], color = new_df['num'])
Around 25% of them were diseased when 'ca' value is 0.
Around 68% of them were diseased when 'ca' value is 1.
Around 81% of them were diseased when 'ca' value is 2.
Around 85% of them were diseased when 'ca' value is 3.
plx.histogram(new_df, x = new_df['thal'], color = new_df['num'])
Around 22% of them were diseased when 'thal' value is 3.
Around 67% of them were diseased when 'thal' value is 6.
Around 76% of them were diseased when 'thal' value is 7.
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix, classification_report
X,Y = df.drop(['num'], axis = 1), df['num']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 35)
rfc = RandomForestClassifier()
rfc.fit(X_train, Y_train)
RandomForestClassifier()
pred_rfc = rfc.predict(X_test)
pred_rfc
array([0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0,
0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
1, 0, 0, 0, 0, 1, 0, 0, 1, 1], dtype=int64)
np.array(Y_test)
array([0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1,
0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1,
0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
1, 1, 0, 0, 0, 1, 0, 0, 1, 1], dtype=int64)
confusion_matrix(Y_test, pred_rfc)
array([[35, 4],
[11, 26]], dtype=int64)
accuracy_score(Y_test, pred_rfc)
0.8026315789473685
print(classification_report(Y_test, pred_rfc))
precision recall f1-score support
0 0.76 0.90 0.82 39
1 0.87 0.70 0.78 37
accuracy 0.80 76
macro avg 0.81 0.80 0.80 76
weighted avg 0.81 0.80 0.80 76
from sklearn.model_selection import RandomizedSearchCV
parameters_rscv_rfc = {'n_estimators' : [10, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000],
'max_depth' : [2,3,4,5,6,None],
'max_features' : ['sqrt', 'log2', None],
'criterion' : ['gini', 'entropy', 'log_loss'],
'min_samples_leaf' : np.arange(1, 4),
'bootstrap' : [True, False]}
RSCV_rfc = RandomizedSearchCV(rfc, parameters_rscv_rfc, cv = 7, random_state = 35, n_jobs = -1, n_iter = 40)
RSCV_rfc.fit(X_train, Y_train)
C:\Users\harip\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py:372: FitFailedWarning:
35 fits failed out of a total of 280.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.
Below are more details about the failures:
--------------------------------------------------------------------------------
14 fits failed with the following error:
Traceback (most recent call last):
File "C:\Users\harip\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "C:\Users\harip\anaconda3\lib\site-packages\sklearn\ensemble\_forest.py", line 450, in fit
trees = Parallel(
File "C:\Users\harip\anaconda3\lib\site-packages\joblib\parallel.py", line 1043, in __call__
if self.dispatch_one_batch(iterator):
File "C:\Users\harip\anaconda3\lib\site-packages\joblib\parallel.py", line 861, in dispatch_one_batch
self._dispatch(tasks)
File "C:\Users\harip\anaconda3\lib\site-packages\joblib\parallel.py", line 779, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "C:\Users\harip\anaconda3\lib\site-packages\joblib\_parallel_backends.py", line 208, in apply_async
result = ImmediateResult(func)
File "C:\Users\harip\anaconda3\lib\site-packages\joblib\_parallel_backends.py", line 572, in __init__
self.results = batch()
File "C:\Users\harip\anaconda3\lib\site-packages\joblib\parallel.py", line 262, in __call__
return [func(*args, **kwargs)
File "C:\Users\harip\anaconda3\lib\site-packages\joblib\parallel.py", line 262, in <listcomp>
return [func(*args, **kwargs)
File "C:\Users\harip\anaconda3\lib\site-packages\sklearn\utils\fixes.py", line 216, in __call__
return self.function(*args, **kwargs)
File "C:\Users\harip\anaconda3\lib\site-packages\sklearn\ensemble\_forest.py", line 185, in _parallel_build_trees
tree.fit(X, y, sample_weight=curr_sample_weight, check_input=False)
File "C:\Users\harip\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 937, in fit
super().fit(
File "C:\Users\harip\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 352, in fit
criterion = CRITERIA_CLF[self.criterion](
KeyError: 'log_loss'
--------------------------------------------------------------------------------
21 fits failed with the following error:
Traceback (most recent call last):
File "C:\Users\harip\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "C:\Users\harip\anaconda3\lib\site-packages\sklearn\ensemble\_forest.py", line 450, in fit
trees = Parallel(
File "C:\Users\harip\anaconda3\lib\site-packages\joblib\parallel.py", line 1043, in __call__
if self.dispatch_one_batch(iterator):
File "C:\Users\harip\anaconda3\lib\site-packages\joblib\parallel.py", line 861, in dispatch_one_batch
self._dispatch(tasks)
File "C:\Users\harip\anaconda3\lib\site-packages\joblib\parallel.py", line 779, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "C:\Users\harip\anaconda3\lib\site-packages\joblib\_parallel_backends.py", line 208, in apply_async
result = ImmediateResult(func)
File "C:\Users\harip\anaconda3\lib\site-packages\joblib\_parallel_backends.py", line 572, in __init__
self.results = batch()
File "C:\Users\harip\anaconda3\lib\site-packages\joblib\parallel.py", line 262, in __call__
return [func(*args, **kwargs)
File "C:\Users\harip\anaconda3\lib\site-packages\joblib\parallel.py", line 262, in <listcomp>
return [func(*args, **kwargs)
File "C:\Users\harip\anaconda3\lib\site-packages\sklearn\utils\fixes.py", line 216, in __call__
return self.function(*args, **kwargs)
File "C:\Users\harip\anaconda3\lib\site-packages\sklearn\ensemble\_forest.py", line 187, in _parallel_build_trees
tree.fit(X, y, sample_weight=sample_weight, check_input=False)
File "C:\Users\harip\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 937, in fit
super().fit(
File "C:\Users\harip\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 352, in fit
criterion = CRITERIA_CLF[self.criterion](
KeyError: 'log_loss'
C:\Users\harip\anaconda3\lib\site-packages\sklearn\model_selection\_search.py:969: UserWarning:
One or more of the test scores are non-finite: [0.81980519 0.86350108 0.8241342 0.85010823 0.85024351 0.8633658
0.82399892 0.82399892 0.85484307 0.85443723 0.84131494 nan
nan 0.85890152 0.8590368 0.72713745 0.85470779 nan
nan 0.84564394 0.84131494 nan 0.8459145 0.74878247
0.85890152 0.86769481 0.81493506 0.80194805 0.84131494 0.86350108
0.86769481 0.83252165 0.85470779 0.85457251 0.85010823 0.83712121
0.85457251 0.85470779 0.85037879 0.80641234]
RandomizedSearchCV(cv=7, estimator=RandomForestClassifier(), n_iter=40,
n_jobs=-1,
param_distributions={'bootstrap': [True, False],
'criterion': ['gini', 'entropy',
'log_loss'],
'max_depth': [2, 3, 4, 5, 6, None],
'max_features': ['sqrt', 'log2', None],
'min_samples_leaf': array([1, 2, 3]),
'n_estimators': [10, 100, 200, 300, 400,
500, 600, 700, 800,
900, 1000]},
random_state=35)
RSCV_rfc.best_estimator_
RandomForestClassifier(max_depth=4, max_features='sqrt', min_samples_leaf=2,
n_estimators=900)
rfc_new = RandomForestClassifier(max_depth=4, max_features='sqrt', min_samples_leaf=2,
n_estimators=900)
rfc_new.fit(X_train, Y_train)
RandomForestClassifier(max_depth=4, max_features='sqrt', min_samples_leaf=2,
n_estimators=900)
pred_rfc_new = rfc_new.predict(X_test)
pred_rfc_new
array([0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0,
0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
1, 1, 0, 0, 0, 1, 0, 0, 1, 1], dtype=int64)
np.array(Y_test)
array([0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1,
0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1,
0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
1, 1, 0, 0, 0, 1, 0, 0, 1, 1], dtype=int64)
confusion_matrix(Y_test, pred_rfc_new)
array([[35, 4],
[ 9, 28]], dtype=int64)
accuracy_score(Y_test, pred_rfc_new)
0.8289473684210527
print(classification_report(Y_test, pred_rfc_new))
precision recall f1-score support
0 0.80 0.90 0.84 39
1 0.88 0.76 0.81 37
accuracy 0.83 76
macro avg 0.84 0.83 0.83 76
weighted avg 0.83 0.83 0.83 76
The Accuracy Score of the model using RandomForestClassifier before Hyperparameter tuning is 80.26%.
The Accuracy Score of the model using RandomForestClassifier after Hyperparameter tuning is 82.89%.
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(n_jobs = -1)
lr.fit(X_train, Y_train)
LogisticRegression(n_jobs=-1)
predict_lr = lr.predict(X_test)
predict_lr
array([0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0,
1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
1, 1, 0, 0, 0, 1, 0, 0, 1, 1], dtype=int64)
lr.score(X_test, Y_test)
0.8026315789473685
confusion_matrix(Y_test, predict_lr)
array([[33, 6],
[ 9, 28]], dtype=int64)
print(classification_report(Y_test, predict_lr))
precision recall f1-score support
0 0.79 0.85 0.81 39
1 0.82 0.76 0.79 37
accuracy 0.80 76
macro avg 0.80 0.80 0.80 76
weighted avg 0.80 0.80 0.80 76
from sklearn.model_selection import RandomizedSearchCV
parameters_rscv_lr = {'penalty' : ['l1', 'l2','elasticnet', 'none'],
'solver' : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}
rscv_lr = RandomizedSearchCV(lr, parameters_rscv_lr)
rscv_lr.fit(X_train, Y_train)
C:\Users\harip\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py:372: FitFailedWarning:
25 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.
Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
File "C:\Users\harip\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "C:\Users\harip\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1461, in fit
solver = _check_solver(self.solver, self.penalty, self.dual)
File "C:\Users\harip\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 447, in _check_solver
raise ValueError(
ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got l1 penalty.
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
File "C:\Users\harip\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "C:\Users\harip\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1461, in fit
solver = _check_solver(self.solver, self.penalty, self.dual)
File "C:\Users\harip\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 447, in _check_solver
raise ValueError(
ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got elasticnet penalty.
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
File "C:\Users\harip\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "C:\Users\harip\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1471, in fit
raise ValueError(
ValueError: l1_ratio must be between 0 and 1; got (l1_ratio=None)
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
File "C:\Users\harip\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "C:\Users\harip\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1461, in fit
solver = _check_solver(self.solver, self.penalty, self.dual)
File "C:\Users\harip\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 447, in _check_solver
raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
File "C:\Users\harip\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "C:\Users\harip\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1461, in fit
solver = _check_solver(self.solver, self.penalty, self.dual)
File "C:\Users\harip\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 457, in _check_solver
raise ValueError(
ValueError: Only 'saga' solver supports elasticnet penalty, got solver=liblinear.
C:\Users\harip\anaconda3\lib\site-packages\sklearn\model_selection\_search.py:969: UserWarning:
One or more of the test scores are non-finite: [0.74454106 nan 0.81951691 0.74463768 nan nan
0.85913043 nan nan 0.83275362]
RandomizedSearchCV(estimator=LogisticRegression(n_jobs=-1),
param_distributions={'penalty': ['l1', 'l2', 'elasticnet',
'none'],
'solver': ['newton-cg', 'lbfgs',
'liblinear', 'sag',
'saga']})
rscv_lr.best_estimator_
LogisticRegression(n_jobs=-1, penalty='none')
lr_new = LogisticRegression(n_jobs=-1, penalty='none')
lr_new.fit(X_train, Y_train)
LogisticRegression(n_jobs=-1, penalty='none')
predict_lr_new = lr_new.predict(X_test)
predict_lr_new
array([0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0,
1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,
1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
1, 1, 0, 0, 0, 1, 0, 0, 1, 1], dtype=int64)
lr_new.score(X_test, Y_test)
0.8289473684210527
cm = confusion_matrix(Y_test, predict_lr_new)
cm
array([[34, 5],
[ 8, 29]], dtype=int64)
plx.imshow(cm, text_auto= True, height = 500, width = 500)
The accuracy score of the model using LogisticRegression is same as the accuracy that we got in RandomForestRegressor i.e., 82.89%.
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
X,Y = df.drop(['num'], axis = 1), df['num']
dtc = DecisionTreeClassifier()
dtc = dtc.fit(X_train, Y_train)
column_names = list(np.array(df.columns))
plt.figure(figsize = (7,7), dpi = 500)
tree.plot_tree(dtc, filled = True, feature_names = column_names, class_names = ['Non_Diseased', 'Diseased'])
plt.show()
print("Max_depth Accuracy")
for i in range(2, 10):
dtc = DecisionTreeClassifier(max_depth = i)
dtc = dtc.fit(X, Y)
xtest_dtc = dtc.predict(X_test)
ytest_dtc = np.array(Y_test)
count = 0
for x in range(0, len(xtest_dtc) ):
if xtest_dtc[x] == ytest_dtc[x]:
count += 1
else:
continue
print(i,'\t',"->", (count / len(xtest_dtc)) * 100)
Max_depth Accuracy 2 -> 72.36842105263158 3 -> 85.52631578947368 4 -> 84.21052631578947 5 -> 89.47368421052632 6 -> 94.73684210526315 7 -> 97.36842105263158 8 -> 98.68421052631578 9 -> 100.0
In the above Decision Tree, there are 8 Decision Nodes and 1 Root node. Max depth of 9 shows 100% accuracy. This means that the Decision Tree is biased on training data, called overfitting. We have to avoid overfitting by choosing optimal max detpth value.
dtc_new = DecisionTreeClassifier(max_depth = 3)
dtc_new.fit(X_train, Y_train)
DecisionTreeClassifier(max_depth=3)
plt.figure(figsize = (7,7), dpi = 1550)
tree.plot_tree(dtc_new, filled = True, feature_names = column_names, class_names = ['Non_Diseased', 'Diseased'])
plt.show()
dtc_new.score(X_test, Y_test)
0.8289473684210527
The Accuracy of the model using DecisionTreeClassifier is 82.89%.
dtc_new_predict = dtc_new.predict(X_test)
dtc_new_predict
array([0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1,
0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1,
0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
1, 0, 0, 0, 0, 1, 0, 0, 1, 1], dtype=int64)
plx.imshow(confusion_matrix(Y_test, dtc_new_predict), text_auto = True,
width = 500, height = 500)
The Accuracy Score of the models using DecisionTreeClassifier, RandomForestClassifier and LogisticRegression is around 82%. Performance of all the 3 models are almost same.